1 Introduction

1.1 Libraries we will use

Load packages:

library(tidyverse)
library(stringr)  # package for manipulating strings (part of tidyverse)

1.2 Dataset we will use

We used rtweet to pull Twitter data from the PAC-12 universities. We used the university admissions Twitter handle if there is one, or the main Twitter handle for the university if there isn’t one:

# library(rtweet)
# 
# p12 <- c("uaadmissions", "FutureSunDevils", "caladmissions", "UCLAAdmission",
#          "futurebuffs", "uoregon", "BeaverVIP", "USCAdmission",
#          "engagestanford", "UtahAdmissions", "UW", "WSUPullman")
# p12_full_df <- search_tweets(paste0("from:", p12, collapse = " OR "), n = 500)
#
# saveRDS(p12_full_df, "p12_dataset.RDS")

# Load previously pulled Twitter data
p12_url <- "https://github.com/anyone-can-cook/rclass2/raw/main/data/recruiting/p12_dataset.RDS"
p12_full_df <- readRDS(url(p12_url, "rb"))

# Use subset of data
p12_df <- p12_full_df %>% select("user_id", "created_at", "screen_name", "text", "location")
head(p12_df)
## # A tibble: 6 x 5
##   user_id  created_at          screen_name text                 location   
##   <chr>    <dttm>              <chr>       <chr>                <chr>      
## 1 22080148 2020-04-25 22:37:18 WSUPullman  "Big Dez is headed … Pullman, W…
## 2 22080148 2020-04-23 21:11:49 WSUPullman  Cougar Cheese. That… Pullman, W…
## 3 22080148 2020-04-21 04:00:00 WSUPullman  "Darien McLaughlin … Pullman, W…
## 4 22080148 2020-04-24 03:00:00 WSUPullman  6 houses, one pick.… Pullman, W…
## 5 22080148 2020-04-20 19:00:21 WSUPullman  Why did you choose … Pullman, W…
## 6 22080148 2020-04-20 02:20:01 WSUPullman  Tell us one of your… Pullman, W…

2 Lecture overview

For a review of string basics, see the first strings lecture.

Credit: Regex Humor (Rex Egg)

In her popular STAT545 class Jenny Bryan, professor of statistics at University of British Columbia, describes regular expressions (regex) as:

A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as “write only”, because regular expressions are easier to write than to read/understand. And they are not particularly easy to write."

Yes, learning regular expressions is painful. So why are we making you do this? Because regular expressions are a fundamental building block of data science.


An annoying thing people say is that data science is about trying to find the “signal in the noise

  • Noisy data “is data with a large amount of additional meaningless information in it called noise” (Wikipedia)
  • Prior to data science revolution, quant people thought of “data” as something in columns and rows
  • The data science revolution is about creating analysis datasets from many pieces of structured, semi-structured, and unstructured data
  • But processing all this semi-structured data requires a lot of complex (and often tedious) data manipulation


Another annoying thing people say is that “data science is 80% data cleaning and 20% analysis.”

Much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

But if the value [of data science] comes from combining different data sets, so does the headache. Data from sensors, documents, the web and conventional databases all come in different formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.

“Practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place,” said Matt Mohebbi, a data scientist and co-founder of Iodine.


So why learn regular expressions? Because regular expressions are THE preeminent tool for identifying data patterns, and cleaning/transforming “noisy” data

  • Most programmers I speak to talk about regular expressions as one of the most important tools for a programmer to learn
  • One could argue that regular expressions are a fundamental driver of the data science revolution, in that they are what made it possible to format and integrate diverse data sources into analysis datasets (I don’t know if that is true, but it seems reasonable!)
  • For example, web-scraping is fundamentally an application of regular expressions. Grabbing data from the internet is usually very easy. The hard part is processing all that html code into something that can be analyzed.


3 What are regular expressions?

What are regular expressions? (Geeks for Geeks)

  • Regular expressions are an efficient way to match different patterns in strings, similar to the ctrl+f or cmd+f function you use to find text in a pdf or word document

  • For example, regex can be used to match all cases of the exact text "out-of-state". But what makes it so powerful is that we could also have it match different variations or patterns, like "Out-of-state", "out of state", etc.

  • In the next subsection, we’ll introduce the str_view() & str_view_all() functions from the stringr package (part of tidyverse) to help us visualize what is being matched with our regular expressions

Credit: Crystal Han, Ozan Jaquette, & Karina Salazar (Recruiting the Out-Of-State University)


3.1 str_view() & str_view_all()


The str_view() & str_view_all() functions:

?str_view
?str_view_all

# SYNTAX AND DEFAULT VALUES
str_view(string, pattern, match = NA)
str_view_all(string, pattern, match = NA)
  • Function:
    • str_view() shows the first match of a regex pattern
    • str_view_all() shows all the matches of a regex pattern
  • Arguments:
    • string: Input vector. Either a character vector, or something coercible to one.
    • pattern: Pattern to look for.
      • The default interpretation is a regular expression, as described in stringi::stringi-search-regex. Control options with regex().
    • match: If TRUE, shows only strings that match the pattern. If FALSE, shows only the strings that don’t match the pattern. Otherwise (the default, NA) displays both matches and non-matches.


Example: Using str_view() & str_view_all() to match literal text

Let’s use these functions to match the exact string "Co" from one of the tweets in our p12_df dataframe. str_view() will show us the first pattern match. Notice that the pattern is case-sensitive, as the "co" in "colleagues" was not matched:

str_view(string = p12_df$text[119], pattern = 'Co')


We can use str_view_all() to show all matches, not just the first match:

str_view_all(string = p12_df$text[119], pattern = 'Co')

3.2 Backslash (\) escape character

“A sequence in a string that starts with a \ is called an escape sequence and allows us to include special characters in our strings.”

Credit: Escape sequences from DataCamp


The backslash (\) is an escape character, where it makes the character(s) that follow it have special meaning. For example, recall how we can use \ to escape quotes in a string:

my_string <- 'Escaping single quote \' within single quotes'
my_string
## [1] "Escaping single quote ' within single quotes"

The alternative would be to create the string using double quotes:

my_string <- "Single quote ' within double quotes does not need escaping"
my_string
## [1] "Single quote ' within double quotes does not need escaping"


Similarly, to include a literal backslash in the string, we need to escape the backslash with another backslash:

my_string <- "The executable is located in C:\\Program Files\\Git\\bin"
my_string
## [1] "The executable is located in C:\\Program Files\\Git\\bin"

We can use writeLines() to see the escaped string:

writeLines(my_string)
## The executable is located in C:\Program Files\Git\bin


The writeLines() function:

?writeLines

# SYNTAX AND DEFAULT VALUES
writeLines(text, con = stdout(), sep = "\n", useBytes = FALSE)
  • Function: “writeLines() displays quotes and backslashes as they would be read, rather than as R stores them.” (From writeLines documentation)
    • When we include escape sequences in the string, it is helpful to use writeLines() to see how the escaped string looks
    • writeLines() will also output the string without showing the outer pair of double quotes that R uses to store it, so we only see the content of the string
  • Arguments:
    • text: Character vector containing the text you want to display


The backslash (\) can also be used to form special characters, such as \n (newline character) and \t (tab character).

These characters following a backslash \ take on new meaning. For example, the n by itself is just a literal n. When you add a backslash to n, you are making it a special character where \n now represents a newline.

my_string <- "A\tB\nC\tD"
my_string
## [1] "A\tB\nC\tD"

Use writeLines() to see the escaped string:

writeLines(my_string)
## A    B
## C    D


Summary: We can use the backslash (\) escape character for:

  • \: escaping purposes
    • \': literal single quote
    • \": literal double quote
    • \\: literal backslash
  • \n: newline
  • \t: tab

3.3 Backslashes in regular expressions

If \ is used as an escape character in regular expressions, how do you match a literal \? Well you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write \\\\ — you need four backslashes to match one!

Credit: R for Data Science Strings Chapter


In regular expressions, the backslash \ also serves as an escape character, so in order to match a literal \, we need to use the regular expression \\. For example, let’s say we want to match the \ in the following string:

my_string <- "The executable is located in C:\\Program Files\\Git\\bin"

# Use writeLines() to see escaped string
writeLines(my_string)
## The executable is located in C:\Program Files\Git\bin

The regular expression we need is \\. But this doesn’t work:

# This will give an error if we try to run it
str_view_all(string = my_string, pattern = "\\")

Why is that? Let’s take a look at what is happening with the string "\\" we are providing as the pattern argument:

# Use writeLines() to see the escaped string
writeLines("\\")
## \

As seen, once escaped, the string "\\" becomes \ - so we were providing \ as the regular expression (i.e., pattern argument) instead of the \\ that we wanted. In order to get \\, we need to use the string "\\\\", where the 1st \ escapes the 2nd and the 3rd \ escapes the 4th:

# Use writeLines() to see the escaped string
writeLines("\\\\")
## \\
# This properly matches the `\` in the string
str_view_all(string = my_string, pattern = "\\\\")


Summary: Whenever we need to use backslash in our regular expression, we’ll need to escape the backslash (by using another backslash) in the string that we provide as the regex pattern. For example, to match a newline character \n we need to use "\\n", to match a tab character \t we need to use "\\t", etc.

4 Regular expression characters

Some common regular expression patterns include (not inclusive):

  • Character classes
  • Quantifiers
  • Anchors
  • Sets and ranges
  • Groups and backreferences

Credit: DaveChild Regular Expression Cheat Sheet

Select each tab

4.1 Character classes

STRING
(type string that represents regex)
REGEX
(to have this appear in your regex)
MATCHES
(to match with this text)
"\\d" \d any digit
"\\D" \D any non-digit
"\\s" \s any whitespace
"\\S" \S any non-whitespace
"\\w" \w any word character
"\\W" \W any non-word character
Other regex involving backslashes…
"\\n" \n newline
"\\t" \t tab
"\\\\" \\ \
"\\." \. .
"\\?" \? ?
"\\(" \( (
"\\)" \) )
"\\{" \{ {
"\\}" \} }

Credit: Working with strings in stringr Cheat sheet


There are certain character classes in regular expression that have special meaning. For example, \d is used to match any digit (i.e., number), \s is used to match any whitespace (i.e., space, tab, or newline character), and \w is used to match any word character (i.e., alphanumeric character or underscore).

“But wait… there’s more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes…” This means in R, you would write out the regex patterns as "\\d", "\\s", "\\w", etc.

Credit: Escaping sequences from Stat 545


Example: Using \d & \D to match digits & non-digits

We can use \d to match all instances of a digit (i.e., number):

# The escaped string "\\d" results in the regex \d
writeLines("\\d")
## \d
# Match any instances of a digit
str_view_all(string = p12_df$text[119], pattern = "\\d")


We can use \D to match all instances of a non-digit character:

# The escaped string "\\D" results in the regex \D
writeLines("\\D")
## \D
# Match any instances of a non-digit
str_view_all(string = p12_df$text[119], pattern = "\\D")


This matches all instances of a digit followed by a non-digit character:

str_view_all(string = p12_df$text[119], pattern = "\\d\\D")

Example: Using \s & \S to match whitespace & non-whitespace

We can use \s to match all instances of a whitespace (i.e., space, tab, or newline character):

# The escaped string "\\s" results in the regex \s
writeLines("\\s")
## \s
# Match any instances of a whitespace
str_view_all(string = p12_df$text[119], pattern = "\\s")


We can use \S to match all instances of a non-whitespace character:

# The escaped string "\\S" results in the regex \S
writeLines("\\S")
## \S
# Match any instances of a non-whitespace
str_view_all(string = p12_df$text[119], pattern = "\\S")


This matches all instances of the letter e followed by a whitespace character:

str_view_all(string = p12_df$text[39], pattern = "e\\s")

Example: Using \w & \W to match words & non-words

We can use \w to match all instances of a word character (i.e., alphanumeric character or underscore):

# The escaped string "\\w" results in the regex \w
writeLines("\\w")
## \w
# Match any instances of a word character
str_view_all(string = p12_df$text[119], pattern = "\\w")


We can use \W to match all instances of a non-word character:

# The escaped string "\\W" results in the regex \W
writeLines("\\W")
## \W
# Match any instances of a non-word character
str_view_all(string = p12_df$text[119], pattern = "\\W")


This matches all instances of 3-letter words:

str_view_all(string = p12_df$text[119], pattern = "\\W\\w\\w\\w\\W")


The second half of the table above shows other regular expressions involving backslashes. This includes special characters like \n and \t, as well as using backslash to escape characters that have special meanings in regex, like . or ? (as we will soon see). So to match a literal period or question mark, we need to use the regex \. and \?, or strings "\\." and "\\?" in R.


4.2 Quantifiers

Character Description
* 0 or more
? 0 or 1
+ 1 or more
{3} Exactly 3
{3,} 3 or more
{3,5} 3, 4, or 5


We can use quantifiers to specify the amount of a certain character or expression to match. The quantifier should directly follow the pattern you want to quantify. For example, s? matches 0 or 1 s and \d{4} matches exactly 4 digits.


Example: Using the *, ?, and + quantifiers

We can use * to match 0 or more of a pattern:

# Matches all instances of `s` followed by 0 or more non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W*")


We can use ? to match 0 or 1 of a pattern:

# Matches all instances of `s` followed by 0 or 1 non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W?")


We can use + to match 1 or more of a pattern:

# Matches all instances of `s` followed by 1 or more non-word character
str_view_all(string = p12_df$text[119], pattern = "s\\W+")
# Matches all hashtags
str_view_all(string = p12_df$text[119], pattern = "#\\w+")

Example: Using {...} to specify how many occurrences to match

We can use {n} to specify the exact number of characters or expressions to match:

# Matches words with exactly 3 letters
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3}\\s")


We can use {n,} to specify n as the minimum amount to match:

# Matches words with 3 or more letters
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,}\\s")


We can use {n,m} to specify we want to match between n and m amount (inclusive):

# Matches words with between 3 to 5 letters (inclusive)
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,5}\\s")



4.3 Anchors

String Character Description
  ^ Start of string, or start of line in multi-line pattern
  $ End of string, or end of line in multi-line pattern
"\\b" \b Word boundary
"\\B" \B Non-word boundary


We can use anchors to indicate which part of the string to match. For example, ^ matches the start of the string, $ matches the end of the string, \b can be used to help detect word boundaries, and \B can be used to help match characters within a word.


Example: Using ^ & $ to match start & end of string

We can use ^ to match the start of a string:

# Matches only the quotation mark at the start of the text and not the end quote
str_view_all(string = p12_df$text[119], pattern = '^"')


We can use $ to match the end of a string:

# Matches only the number at the end of the text and not any other numbers
str_view_all(string = p12_df$text[119], pattern = "\\d$")

Example: Using \b & \B to match word boundary & non-word boundary

We can use \b to help detect word boundary:

# Matches words with 3 or more letters using \b
str_view_all(string = p12_df$text[119], pattern = "\\b\\w{3,}\\b")

Notice how this is much flexible than trying to use whitespace (\s) to determine word boundary:

# Matches words with 3 or more letters using \s
str_view_all(string = p12_df$text[119], pattern = "\\s\\w{3,}\\s")


We can use \B to help match characters within a word:

# Matches only the letter `s` within a word and not at the start or end
str_view_all(string = p12_df$text[119], pattern = "\\Bs\\B")



4.4 Sets and ranges

Character Description
. Match any character except newline (\n)
a|b Match a or b
[abc] Match either a, b, or c
[^abc] Match anything except a, b, or c
[a-z] Match range of lowercase letters from a to z
[A-Z] Match range of uppercase letters from A to Z
[0-9] Match range of numbers from 0 to 9


The table above lists some more ways regular expression offers us flexibility and option in what we want to match. The period . acts as a wildcard to match any character except newline. The vertical bar | is similar to an OR operator. Square brackets [...] can be used to specify a set or range of characters to match (or not to match).


Example: Using . as a wildcard

We can use . to match any character except newline (\n):

# Matches any character except newline
str_view_all(string = p12_df$text[119], pattern = ".")


We can confirm there is a newline in the tweet above by using writeLines():

writeLines(p12_df$text[119])
## "I stand with my colleagues at @UW and America's leading research universities as they take fight to Covid-19 in our labs and hospitals."
## 
## #ProudToBeOnTheirTeam x #AlwaysCompete x #GoHuskies https://t.co/4YSf4SpPe0

Example: Using | as an OR operator

We can use | to match either one of multiple patterns:

# Matches `research`, `fight`, or `labs`
str_view_all(string = p12_df$text[119], pattern = "research|fight|labs")
# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "@\\w+|#\\w+")

Example: Using [...] to match (or not match) a set or range of characters

We can use [...] to match any set of characters:

# Matches hashtags or handles
str_view_all(string = p12_df$text[119], pattern = "[@#]\\w+")
# Matches any 2 consecutive vowels
str_view_all(string = p12_df$text[119], pattern = "[aeiouAEIOU]{2}")


We can also use [...] to match any range of alpha or numeric characters:

# Matches only lowercase x through z or uppercase A through C
str_view_all(string = p12_df$text[119], pattern = "[x-zA-C]")
# Matches only numbers 1 through 4 or the pound sign
str_view_all(string = p12_df$text[119], pattern = "[1-4#]")


We can use [^...] to indicate we do not want to match the provided set or range of characters:

# Matches anything except vowels
str_view_all(string = p12_df$text[119], pattern = "[^aeiouAEIOU]")
# Matches anything that's not uppercase letters
str_view_all(string = p12_df$text[119], pattern = "[^A-Z]+")

Notice that [...] only matches a single character (see second to last example above). We need to use quantifiers if we want to match a stretch of characters (see last example above).



4.5 Groups and backreferences

String Character Description
  (...) Capturing group
  (?:...) Non-capturing group
"\\1" \1 Part of the string matched by capturing group 1
"\\2" \2 Part of the string matched by capturing group 2


Parentheses can be used to group parts of our regular expression together. Normal parentheses (...) creates what is called a numbered capturing group. “A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses”. For example, if we have (\d), we can refer back to the digit matched by this capturing group using backreferences, like \1.

Credit: Hadley Wickham (R for Data Science) Grouping and backreferences

If we only want to use parentheses for grouping purposes and do not need to reference the matched values, we can use a non-capturing group (?:...).


Example: Using capturing groups (...) and backreferences

We can use capturing groups (...) to match certain patterns, then reference what was matched:

# Matches any letter that is repeated 2 times in a row
str_view_all(string = p12_df$text[119], pattern = "([A-Za-z])\\1")
# Matches any string of characters where the first and last letters are the same,
# and the second and least letters are the same
str_view_all(string = p12_df$text[119], pattern = "([a-z])([a-z]).*\\2\\1")

Example: Using non-capturing groups (?:...) for grouping purposes

We can use non-capturing groups (?:...) if we just want to group certain parts of the regex but don’t need to reference the matched value:

# Matches one or more of a digit followed by 3 letters
str_view_all(string = p12_df$text[119], pattern = "(?:\\d[A-Za-z]{3})+")


Normal parentheses (capturing groups) can still work for general grouping purposes too. But if you want to group things together without capturing them, you can just use non-capturing groups:

# Here, we have 2 capturing groups but only need to reference the 2nd
str_view_all(string = "A1A1A1eeee", pattern = "([A-Z]\\d)+([a-z])\\2{2}")
# So we can just turn the first group into a non-capturing group
str_view_all(string = "A1A1A1eeee", pattern = "(?:[A-Z]\\d)+([a-z])\\1{2}")



5 Regex with stringr functions

Using regex in stringr functions (From R for Data Science)

  • When we specify a pattern in a stringr function, such as str_view(), it is automatically wrapped in a call to regex() (i.e., treated as a regular expression)

    # This function call:
    str_view(string = "Turn to page 394...", pattern = "\\d+")
    # Is shorthand for:
    str_view(string = "Turn to page 394...", pattern = regex("\\d+"))
  • For simplicity, we can omit the call to regex()

  • But, there are additional arguments we can supply to regex() if we wanted

    • regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE, ...)
    • ignore_case: If TRUE, allows characters to match either their uppercase or lowercase forms
    • multiline: If TRUE, allows ^ and $ to match the start and end of each line rather than the start and end of the complete string
    • comments: If TRUE, allows you to use comments and whitespace to make complex regular expressions more understandable
      • Spaces are ignored, as is everything after #
      • To match a literal space, you’ll need to escape it: "\\ "

Example: Specifying ignore_case = TRUE in regex()

Let’s say we have the following string:

s <- "Yay, yay.... YAY!"
s
## [1] "Yay, yay.... YAY!"


We can match all the yay’s using the following regex:

str_view_all(string = s, pattern = "[Yy][Aa][Yy]")


Equivalently, we can specify ignore_case = TRUE to avoid dealing with casing variations:

str_view_all(string = s, pattern = regex("yay", ignore_case = TRUE))

5.1 str_detect()


The str_detect() function:

?str_detect

# SYNTAX AND DEFAULT VALUES
str_detect(string, pattern, negate = FALSE)
  • Function: Detects the presence or absence of a pattern in a string
    • Returns logical vector (TRUE if there is a match, FALSE if there is not)
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • negate: If set to TRUE, the returned logical vector will contain TRUE if there is not a match and FALSE if there is one

Example: Using str_detect() on string

# Detects if there is a digit in the string
str_detect(string = "P. Sherman 42 Wallaby Way", pattern = "\\d")
## [1] TRUE

Example: Using str_detect() on character vector

# Detects if there is a digit in each string in the vector
str_detect(string = c("One", "25th", "3000"), pattern = "\\d")
## [1] FALSE  TRUE  TRUE

Example: Using str_detect() on dataframe column

Let’s create new columns in p12_df called is_am and is_pm that indicates whether or not each tweet’s created_at time is in the AM or PM, respectively:

p12_df %>%
  mutate(
    # Returns `TRUE` if the hour is 0#, 10, or 11, `FALSE` otherwise
    is_am = str_detect(string = created_at, pattern = " 0\\d| 1[01]"),
    # Recall we can set the `negate` argument to switch the returned `TRUE`/`FALSE`
    is_pm = str_detect(string = created_at, pattern = " 0\\d| 1[01]", negate = TRUE)
  ) %>% select(created_at, is_am, is_pm)
## # A tibble: 328 x 3
##    created_at          is_am is_pm
##    <dttm>              <lgl> <lgl>
##  1 2020-04-25 22:37:18 FALSE TRUE 
##  2 2020-04-23 21:11:49 FALSE TRUE 
##  3 2020-04-21 04:00:00 TRUE  FALSE
##  4 2020-04-24 03:00:00 TRUE  FALSE
##  5 2020-04-20 19:00:21 FALSE TRUE 
##  6 2020-04-20 02:20:01 TRUE  FALSE
##  7 2020-04-22 04:00:00 TRUE  FALSE
##  8 2020-04-25 17:00:00 FALSE TRUE 
##  9 2020-04-21 15:13:06 FALSE TRUE 
## 10 2020-04-21 17:52:47 FALSE TRUE 
## # … with 318 more rows


Because TRUE evaluates to 1 and FALSE evaluates to 0 in a numerical context, we could also sum the returned logical vector to see how many of the elements in the vector had a match:

# Number of tweets that were created in the AM
num_am_tweets <- sum(str_detect(string = p12_df$created_at, pattern = " 0\\d| 1[01]"))
num_am_tweets
## [1] 53


Additionally, we can take the average of the logical vector to get the proportion of elements in the input vector that had a match:

# Proportion of tweets that were created in the AM
pct_am_tweets <- mean(str_detect(string = p12_df$created_at, pattern = " 0\\d| 1[01]"))
pct_am_tweets
## [1] 0.1615854


We can also use the logical vector returned from str_detect() to filter p12_df to only include rows that had a match:

# Keep only rows whose tweet was created in the AM
p12_df %>%
  filter(str_detect(string = created_at, pattern = " 0\\d| 1[01]"))
## # A tibble: 53 x 5
##    user_id  created_at          screen_name text                location   
##    <chr>    <dttm>              <chr>       <chr>               <chr>      
##  1 22080148 2020-04-21 04:00:00 WSUPullman  "Darien McLaughlin… Pullman, W…
##  2 22080148 2020-04-24 03:00:00 WSUPullman  6 houses, one pick… Pullman, W…
##  3 22080148 2020-04-20 02:20:01 WSUPullman  Tell us one of you… Pullman, W…
##  4 22080148 2020-04-22 04:00:00 WSUPullman  We loved seeing yo… Pullman, W…
##  5 22080148 2020-04-24 01:58:04 WSUPullman  #WSU agricultural … Pullman, W…
##  6 22080148 2020-04-22 02:22:03 WSUPullman  Nice 👍 https://t.c… Pullman, W…
##  7 15988549 2020-04-20 02:52:31 CalAdmissi… @PaulineARoxas Con… Berkeley, …
##  8 15988549 2020-04-22 03:07:00 CalAdmissi… It’s time to make … Berkeley, …
##  9 15988549 2020-04-22 00:00:08 CalAdmissi… "Are you a #Berkel… Berkeley, …
## 10 15988549 2020-04-20 03:03:21 CalAdmissi… "@N48260756 We sug… Berkeley, …
## # … with 43 more rows

5.2 str_subset()


The str_subset() function:

?str_subset

# SYNTAX AND DEFAULT VALUES
str_subset(string, pattern, negate = FALSE)
  • Function: Keeps strings that match a pattern
    • Returns input vector filtered to only keep elements that match the specified pattern
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • negate: If set to TRUE, the returned vector will contain only elements that did not match the specified pattern

Example: Using str_subset() on character vector

# Subsets the input vector to only keep elements that contain a digit
str_subset(string = c("One", "25th", "3000"), pattern = "\\d")
## [1] "25th" "3000"

Example: Using str_subset() on dataframe column

# Subsets the `created_at` vector of `p12_df` to only keep elements that occured in the AM
str_subset(string = p12_df$created_at, pattern = " 0\\d| 1[01]")
##  [1] "2020-04-21 04:00:00" "2020-04-24 03:00:00" "2020-04-20 02:20:01"
##  [4] "2020-04-22 04:00:00" "2020-04-24 01:58:04" "2020-04-22 02:22:03"
##  [7] "2020-04-20 02:52:31" "2020-04-22 03:07:00" "2020-04-22 00:00:08"
## [10] "2020-04-20 03:03:21" "2020-04-22 00:47:00" "2020-04-23 06:34:00"
## [13] "2020-04-23 04:06:49" "2020-04-19 03:32:21" "2020-04-20 02:53:38"
## [16] "2020-04-20 02:53:14" "2020-04-20 03:04:11" "2020-04-19 03:30:14"
## [19] "2020-04-20 02:58:55" "2020-04-19 05:37:00" "2020-04-21 02:34:00"
## [22] "2020-04-20 00:15:07" "2020-04-25 04:18:29" "2020-04-25 00:00:01"
## [25] "2020-04-21 02:33:00" "2020-04-24 01:00:01" "2020-04-23 02:38:46"
## [28] "2020-04-24 04:48:28" "2020-04-24 01:06:33" "2020-04-25 04:48:08"
## [31] "2020-04-22 00:10:43" "2020-04-21 05:58:12" "2020-04-24 01:41:19"
## [34] "2020-04-24 01:42:44" "2020-04-24 01:43:11" "2020-04-23 02:45:24"
## [37] "2020-04-20 00:44:42" "2020-04-24 01:41:13" "2020-04-25 00:26:02"
## [40] "2020-04-25 00:31:23" "2020-04-25 00:46:40" "2020-04-25 00:20:36"
## [43] "2020-04-20 00:09:58" "2020-04-20 00:09:46" "2020-04-20 00:10:08"
## [46] "2020-04-25 00:29:12" "2020-04-22 01:45:02" "2020-04-23 02:00:14"
## [49] "2020-04-25 00:34:47" "2020-04-24 02:11:51" "2020-04-25 00:05:59"
## [52] "2020-04-21 04:14:11" "2020-04-23 02:13:21"

5.3 str_extract() & str_extract_all()


The str_extract() & str_extract_all() functions:

?str_extract
?str_extract_all

# SYNTAX AND DEFAULT VALUES
str_extract(string, pattern)
str_extract_all(string, pattern, simplify = FALSE)
  • Function: Extracts matching patterns from a string
    • Returns first match (str_extract()) or all matches (str_extract_all()) for input vector
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • simplify: If set to TRUE, the returned matches will be in a character matrix rather than the default list of character vectors

Example: Using str_extract() & str_extract_all() on character vector

[str_extract()] Extract the first occurrence of a word for each string:

# Extracts first match of a word
str_extract(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"),
            pattern = "\\w+")
## [1] "Three" "Two"   "A"

[str_extract_all()] Extract all occurrences of a word for each string:

# Extracts all matches of a word, returning a list of character vectors
str_extract_all(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+")
## [[1]]
## [1] "Three"  "French" "hens"  
## 
## [[2]]
## [1] "Two"    "turtle" "doves" 
## 
## [[3]]
## [1] "A"         "partridge" "in"        "a"         "pear"      "tree"
# Extracts all matches of a word, returning a character matrix
str_extract_all(string = c("Three French hens", "Two turtle doves", "A partridge in a pear tree"), 
                pattern = "\\w+", simplify = TRUE)
##      [,1]    [,2]        [,3]    [,4] [,5]   [,6]  
## [1,] "Three" "French"    "hens"  ""   ""     ""    
## [2,] "Two"   "turtle"    "doves" ""   ""     ""    
## [3,] "A"     "partridge" "in"    "a"  "pear" "tree"

Example: Using str_extract() & str_extract_all() on dataframe column

[str_extract()] Extract first hashtag:

# Extracts first match of a hashtag (if there is one)
p12_df %>% 
  mutate(
    hashtag = str_extract(string = text, pattern = "#\\S+")
  ) %>% select(text, hashtag)
## # A tibble: 328 x 2
##    text                                                            hashtag 
##    <chr>                                                           <chr>   
##  1 "Big Dez is headed to Indy!\n\n#GoCougs | #NFLDraft2020 | @dad… #GoCougs
##  2 Cougar Cheese. That's it. That's the tweet. 🧀#WSU #GoCougs htt… #WSU    
##  3 "Darien McLaughlin '19, and her dog, Yuki, went on a #Pullman … #Pullman
##  4 6 houses, one pick. Cougs, which one you got? Reply ⬇️  #WSU #… #WSU    
##  5 Why did you choose to attend @WSUPullman?🤔 #WSU #GoCougs https… #WSU    
##  6 Tell us one of your Bryan Clock Tower memories ⏰ 🐾 #WSU #GoCou… #WSU    
##  7 We loved seeing your top three @WSUPullman buildings, but what… #WSU    
##  8 "Congratulations, graduates! We’re two weeks away from the #WS… #WSU    
##  9 Learn more about this story at https://t.co/45BzKc2rFE. #WSU #… #WSU    
## 10 "Tomorrow, our @WSUEsports Team is facing off against \n@Espor… #GoCoug…
## # … with 318 more rows

[str_extract_all()] Extract all hashtags:

# Extracts all matches of hashtags (if there are any)
p12_df %>% 
  mutate(
    hashtag_vector = str_extract_all(string = text, pattern = "#\\S+"),
    # Use `as.character()` so we can see the content of the character vector of matches
    hashtags = as.character(hashtag_vector)
  ) %>% select(text, hashtag_vector, hashtags)
## # A tibble: 328 x 3
##    text                             hashtag_vector hashtags                
##    <chr>                            <list>         <chr>                   
##  1 "Big Dez is headed to Indy!\n\n… <chr [3]>      "c(\"#GoCougs\", \"#NFL…
##  2 Cougar Cheese. That's it. That'… <chr [2]>      "c(\"#WSU\", \"#GoCougs…
##  3 "Darien McLaughlin '19, and her… <chr [3]>      "c(\"#Pullman\", \"#Cou…
##  4 6 houses, one pick. Cougs, whic… <chr [3]>      "c(\"#WSU\", \"#CougsCo…
##  5 Why did you choose to attend @W… <chr [2]>      "c(\"#WSU\", \"#GoCougs…
##  6 Tell us one of your Bryan Clock… <chr [2]>      "c(\"#WSU\", \"#GoCougs…
##  7 We loved seeing your top three … <chr [2]>      "c(\"#WSU\", \"#GoCougs…
##  8 "Congratulations, graduates! We… <chr [3]>      "c(\"#WSU\", \"#CougGra…
##  9 Learn more about this story at … <chr [2]>      "c(\"#WSU\", \"#GoCougs…
## 10 "Tomorrow, our @WSUEsports Team… <chr [1]>      #GoCougs!               
## # … with 318 more rows

5.4 str_match() & str_match_all()


The str_match() & str_match_all() functions:

?str_match
?str_match_all

# SYNTAX
str_match(string, pattern)
str_match_all(string, pattern)
  • Function: Extracts matched groups from a string
    • Returns a character matrix containing the full match in the first column, then additional columns for matches from each capturing group
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for

Example: Using str_match() & str_match_all() on character vector

[str_match()] Extract the first month, day, year for each string:

# Extracts first match of month, day, year
str_match(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
          pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)")
##      [,1]       [,2] [,3] [,4]  
## [1,] "5-1-2020" "5"  "1"  "2020"
## [2,] "12/25/17" "12" "25" "17"  
## [3,] "01.01.13" "01" "01" "13"

[str_match_all()] Extract all month, day, year for each string:

# Extracts all matches of month, day, year
str_match_all(string = c("5-1-2020", "12/25/17", "01.01.13 to 01.01.14"),
              pattern = "(\\d+)[-/\\.](\\d+)[-/\\.](\\d+)")
## [[1]]
##      [,1]       [,2] [,3] [,4]  
## [1,] "5-1-2020" "5"  "1"  "2020"
## 
## [[2]]
##      [,1]       [,2] [,3] [,4]
## [1,] "12/25/17" "12" "25" "17"
## 
## [[3]]
##      [,1]       [,2] [,3] [,4]
## [1,] "01.01.13" "01" "01" "13"
## [2,] "01.01.14" "01" "01" "14"

Example: Using str_match() on dataframe column

Below, we extract datetime from the created_at column. The first capturing group matches the date part and the second capturing group matches the time part:

datetime_regex <- "([\\d-]+) ([\\d:]+)"
p12_df %>%
  mutate(
    # The 1st capturing group will be in the 2nd column of the matrix returned from `str_match()`
    # So we use [, 2] below and save the result to the `date` column of the dataframe
    date = str_match(string = created_at, pattern = datetime_regex)[, 2],
    # The 2nd capturing group will be in the 3rd column of the matrix returned from `str_match()`
    # So we use [, 3] below and save the result to the `time` column of the dataframe
    time = str_match(string = created_at, pattern = datetime_regex)[, 3]
  ) %>% select(created_at, date, time)
## # A tibble: 328 x 3
##    created_at          date       time    
##    <dttm>              <chr>      <chr>   
##  1 2020-04-25 22:37:18 2020-04-25 22:37:18
##  2 2020-04-23 21:11:49 2020-04-23 21:11:49
##  3 2020-04-21 04:00:00 2020-04-21 04:00:00
##  4 2020-04-24 03:00:00 2020-04-24 03:00:00
##  5 2020-04-20 19:00:21 2020-04-20 19:00:21
##  6 2020-04-20 02:20:01 2020-04-20 02:20:01
##  7 2020-04-22 04:00:00 2020-04-22 04:00:00
##  8 2020-04-25 17:00:00 2020-04-25 17:00:00
##  9 2020-04-21 15:13:06 2020-04-21 15:13:06
## 10 2020-04-21 17:52:47 2020-04-21 17:52:47
## # … with 318 more rows

5.5 str_replace() & str_replace_all()


The str_replace() & str_replace_all() functions:

?str_replace
?str_replace_all

# SYNTAX
str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)
  • Function: Replaces matched patterns in a string
    • Returns input vector with first match (str_replace()) or all matches (str_replace_all()) for each string replaced with specified replacement
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for
    • replacement: What the matched pattern should be replaced with
  • str_replace_all() also supports multiple replacements, where you can omit the replacement argument and just provide a named vector of replacements as the pattern

Example: Using str_replace() & str_replace_all()

[str_replace()] Replace the first occurrence of a vowel:

# Replace first vowel with empty string
str_replace(string = "Thanks for the Memories", pattern = "[aeiou]", replacement = "")
## [1] "Thnks for the Memories"

[str_replace_all()] Replace all occurrences of a vowel:

# Replace all vowels with empty strings
str_replace_all(string = "Thanks for the Memories", pattern = "[aeiou]", replacement = "")
## [1] "Thnks fr th Mmrs"

Example: Using backreferences with str_replace() & str_replace_all()

[str_replace()] Reorders the first date that is matched:

# Use \\1, \\2, and \\3 to refer to the capturing groups (ie. month, day, year)
str_replace(string = "12/31/19 to 01/01/20", pattern = "(\\d+)/(\\d+)/(\\d+)",
            replacement = "20\\3-\\1-\\2")
## [1] "2019-12-31 to 01/01/20"

[str_replace_all()] Reorders all dates that are matched:

# Use \\1, \\2, and \\3 to refer to the capturing groups (ie. month, day, year)
str_replace_all(string = "12/31/19 to 01/01/20", pattern = "(\\d+)/(\\d+)/(\\d+)",
                replacement = "20\\3-\\1-\\2")
## [1] "2019-12-31 to 2020-01-01"

Example: Using str_replace_all() for multiple replacements

# Replace all occurrences of "at" with "@", and all digits with "#"
str_replace_all(string = "Tomorrow at 10:30AM", pattern = c("at" = "@", "\\d" = "#"))
## [1] "Tomorrow @ ##:##AM"

Example: Using str_replace_all() on dataframe column

p12_df %>%
  mutate(
    # Replace all hashtags and handles from tweet with an empty string
    removed_hashtags_handles = str_replace_all(string = text, pattern = "[@#]\\S+", replacement = "")
  ) %>% select(text, removed_hashtags_handles)
## # A tibble: 328 x 2
##    text                              removed_hashtags_handles              
##    <chr>                             <chr>                                 
##  1 "Big Dez is headed to Indy!\n\n#… "Big Dez is headed to Indy!\n\n |  | …
##  2 Cougar Cheese. That's it. That's… Cougar Cheese. That's it. That's the …
##  3 "Darien McLaughlin '19, and her … "Darien McLaughlin '19, and her dog, …
##  4 6 houses, one pick. Cougs, which… 6 houses, one pick. Cougs, which one …
##  5 Why did you choose to attend @WS… Why did you choose to attend    https…
##  6 Tell us one of your Bryan Clock … Tell us one of your Bryan Clock Tower…
##  7 We loved seeing your top three @… We loved seeing your top three  build…
##  8 "Congratulations, graduates! We’… "Congratulations, graduates! We’re tw…
##  9 Learn more about this story at h… "Learn more about this story at https…
## 10 "Tomorrow, our @WSUEsports Team … "Tomorrow, our  Team is facing off ag…
## # … with 318 more rows

5.6 str_split()


The str_split() function:

?str_split

# SYNTAX AND DEFAULT VALUES
str_split(string, pattern, n = Inf, simplify = FALSE)
  • Function: Splits a string by specified pattern
    • Returns character vector containing the split substrings
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for and split by
    • n: Maximum number of substrings to return
    • simplify: If set to TRUE, the returned matches will be in a character matrix rather than the default list of character vectors

Example: Using str_split() on character vector

# Split by comma or the word "and"
str_split(string = c("The Lion, the Witch, and the Wardrobe", "Peanut butter and jelly"),
          pattern = ",? and |, ")
## [[1]]
## [1] "The Lion"     "the Witch"    "the Wardrobe"
## 
## [[2]]
## [1] "Peanut butter" "jelly"


We can specify n to control the maximum number of substrings we want to return:

# Limit split to only return 2 substrings
str_split(string = c("The Lion, the Witch, and the Wardrobe", "Peanut butter and jelly"),
          pattern = ",? and |, ", n = 2)
## [[1]]
## [1] "The Lion"                    "the Witch, and the Wardrobe"
## 
## [[2]]
## [1] "Peanut butter" "jelly"


We can specify simplify = TRUE to return a character matrix instead of a list:

# Return split substrings in a character matrix
str_split(string = c("The Lion, the Witch, and the Wardrobe", "Peanut butter and jelly"),
          pattern = ",? and |, ", simplify = TRUE)
##      [,1]            [,2]        [,3]          
## [1,] "The Lion"      "the Witch" "the Wardrobe"
## [2,] "Peanut butter" "jelly"     ""

Example: Using str_split() on dataframe column

When we split the created_at field at either a hyphen or space, we can separated out the year, month, day, and time components of the string:

p12_df %>%
  mutate(
    # Use `as.character()` so we can see the content of the character vector of splitted strings
    year_month_day_time = as.character(str_split(string = created_at, pattern = "[- ]"))
  ) %>% select(created_at, year_month_day_time)
## # A tibble: 328 x 2
##    created_at          year_month_day_time                        
##    <dttm>              <chr>                                      
##  1 2020-04-25 22:37:18 "c(\"2020\", \"04\", \"25\", \"22:37:18\")"
##  2 2020-04-23 21:11:49 "c(\"2020\", \"04\", \"23\", \"21:11:49\")"
##  3 2020-04-21 04:00:00 "c(\"2020\", \"04\", \"21\", \"04:00:00\")"
##  4 2020-04-24 03:00:00 "c(\"2020\", \"04\", \"24\", \"03:00:00\")"
##  5 2020-04-20 19:00:21 "c(\"2020\", \"04\", \"20\", \"19:00:21\")"
##  6 2020-04-20 02:20:01 "c(\"2020\", \"04\", \"20\", \"02:20:01\")"
##  7 2020-04-22 04:00:00 "c(\"2020\", \"04\", \"22\", \"04:00:00\")"
##  8 2020-04-25 17:00:00 "c(\"2020\", \"04\", \"25\", \"17:00:00\")"
##  9 2020-04-21 15:13:06 "c(\"2020\", \"04\", \"21\", \"15:13:06\")"
## 10 2020-04-21 17:52:47 "c(\"2020\", \"04\", \"21\", \"17:52:47\")"
## # … with 318 more rows

5.7 str_count()


The str_count() function:

?str_count

# SYNTAX AND DEFAULT VALUES
str_count(string, pattern = "")
  • Function: Counts the number of matches in a string
    • Returns the number of matches
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for

Example: Using str_count() on character vector

# Counts the number of digits
str_count(string = c("H2O2", "Year 3000", "4th of July"), pattern = "\\d")
## [1] 2 4 1

Example: Using str_count() on dataframe column

p12_df %>%
  mutate(
    # Counts the total number of hashtags and mentions
    num_hashtags_and_mentions = str_count(string = text, pattern = "[@#]\\S+")
  ) %>% select(text, num_hashtags_and_mentions)
## # A tibble: 328 x 2
##    text                                               num_hashtags_and_men…
##    <chr>                                                              <int>
##  1 "Big Dez is headed to Indy!\n\n#GoCougs | #NFLDra…                     5
##  2 Cougar Cheese. That's it. That's the tweet. 🧀#WSU…                     2
##  3 "Darien McLaughlin '19, and her dog, Yuki, went o…                     4
##  4 6 houses, one pick. Cougs, which one you got? Rep…                     3
##  5 Why did you choose to attend @WSUPullman?🤔 #WSU #…                     3
##  6 Tell us one of your Bryan Clock Tower memories ⏰ …                     2
##  7 We loved seeing your top three @WSUPullman buildi…                     3
##  8 "Congratulations, graduates! We’re two weeks away…                     3
##  9 Learn more about this story at https://t.co/45BzK…                     2
## 10 "Tomorrow, our @WSUEsports Team is facing off aga…                     5
## # … with 318 more rows

5.8 str_locate() & str_locate_all()


The str_locate() & str_locate_all() functions:

?str_locate
?str_locate_all

# SYNTAX
str_locate(string, pattern)
str_locate_all(string, pattern)
  • Function: Locates the position of patterns in a string
    • Returns an integer matrix containing the start position of match in the first column and end position of match in second column
  • Arguments:
    • string: Character vector (or vector coercible to character) to search
    • pattern: Pattern to look for

Example: Using str_locate() & str_locate_all() on character vector

[str_locate()] Locate the start and end positions for first stretch of numbers:

# Locate positions for first stretch of numbers
str_locate(string = c("555.123.4567", "(555) 135-7900 and (555) 246-8000"),
           pattern = "\\d+")
##      start end
## [1,]     1   3
## [2,]     2   4

[str_locate_all()] Locate the start and end positions for all stretches of numbers:

# Locate positions for all stretches of numbers
str_locate_all(string = c("555.123.4567", "(555) 135-7900 and (555) 246-8000"),
               pattern = "\\d+")
## [[1]]
##      start end
## [1,]     1   3
## [2,]     5   7
## [3,]     9  12
## 
## [[2]]
##      start end
## [1,]     2   4
## [2,]     7   9
## [3,]    11  14
## [4,]    21  23
## [5,]    26  28
## [6,]    30  33

Example: Using str_locate() on dataframe column

p12_df %>%
  mutate(
    # Start position of first hashtag in tweet (ie. 1st column of matrix returned from `str_locate()`)
    start_of_first_hashtag = str_locate(string = text, pattern = "#\\S+")[, 1],
    # End position of first hashtag in tweet (ie. 2nd column of matrix returned from `str_locate()`)
    end_of_first_hashtag = str_locate(string = text, pattern = "#\\S+")[, 2],
    # Length of first hashtag in tweet (ie. difference between start and end positions)
    length_of_first_hashtag = end_of_first_hashtag - start_of_first_hashtag
  ) %>% select(text, start_of_first_hashtag, end_of_first_hashtag, length_of_first_hashtag)
## # A tibble: 328 x 4
##    text               start_of_first_ha… end_of_first_ha… length_of_first_…
##    <chr>                           <int>            <int>             <int>
##  1 "Big Dez is heade…                 29               36                 7
##  2 Cougar Cheese. Th…                 46               49                 3
##  3 "Darien McLaughli…                 53               60                 7
##  4 6 houses, one pic…                 57               60                 3
##  5 Why did you choos…                 44               47                 3
##  6 Tell us one of yo…                 52               55                 3
##  7 We loved seeing y…                144              147                 3
##  8 "Congratulations,…                 59               62                 3
##  9 Learn more about …                 57               60                 3
## 10 "Tomorrow, our @W…                266              274                 8
## # … with 318 more rows

6 Appendix

6.1 RegExplain Addin

Regular expressions are tricky. RegExplain makes it easier to see what you’re doing.

Credit: Garrick Aden-Buie (RegExplain)


RegExplain is an RStudio addin that allows the user to check their regex matching functions interactively.

# Installation
devtools::install_github("gadenbuie/regexplain")
library(regexplain)

6.2 HTML

Markup Language

“A markup language is a computer language that uses tags to define elements within a document. It is human-readable, meaning markup files contain standard words, rather than typical programming syntax.”

Credit: Markup Language from TechTerms


Hypertext Markup Language (HTML)

  • HTML is a markup language for the creation of websites
    • HTML puts the content on the webpage, but does not “style” the page (e.g., fonts, colors, background)
    • CSS (Cascading Style Sheets) adds style to the webpage (e.g., fonts, colors, etc.)
    • Javascript adds functionality to the webpage

6.2.1 HTML Basics

Intro to HTML (and CSS)


A Simple HTML Document (From w3schools)

  • HTML consists of a series of elements
    • Elements are defined by a start tag, some content, and an end tag:
      • <tagname> Content </tagname>
    • Elements can be nested within one another
  • Components of a basic HTML document:
    • Begin with <!DOCTYPE html> to indicate it is an HTML document
    • The <html> element is the root element of an HTML page, where all other elements are nested
    • The <head> element contains meta information about the document (ie. not displayed on webpage)
      • Including CSS style to apply to html content
    • The <body> element contains the visible page content
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>


6.2.2 Tags

What are HTML tags?

  • HTML tags are element names surrounded by angle brackets
  • Tags usually come in pairs (e.g. <p> and </p>)
    • The first tag is the start tag and the second tag is the end tag
  • But some tags are self-closing (e.g., <img />)

Credit: HTML introduction from W3schools


Some common HTML tags (not inclusive):

Tag Description
<h1> - <h6> Heading
<p> Paragraph
<a> Link
<img> Image
<div> Division (can think of it as a container to group other elements)
<strong> Bold
<em> Italics
<ul> Unordered list (consists of <li> elements)
<ol> Ordered list (consists of <li> elements)
  <li>     List item
<table> Table (consists of <tr>, <td>, & <th> elements)
  <tr>     Table row
  <td>     Table data/cell
  <th>     Table header


6.2.3 Attributes

What are attributes?

  • Attributes in HTML elements are optional, but all HTML elements can have attributes
  • Attributes are used to specify additional characteristics of elements
  • Attributes are always specified in the start tag
  • Attributes usually come in name/value pairs like: name="value"

Credit: HTML attributes from W3schools


Some common attributes you may encounter:

  • The href attribute for an <a> tag (specifies url to link to):

    <a href="https://www.w3schools.com">This is a link</a>
  • The src attribute for an <img> tag (specifies image to display):

    <img src="html_cheatsheet.jpg" />
  • You can add more than one attribute to an element:

    <img src="html_cheatsheet.jpg" width="200" height="300" />
  • The class and id attributes are also commonly added to elements to be able to identify and select for them


6.2.3.1 class

  • The class attribute can specify one or more class names for an HTML element
  • An element can be identified by its class
  • You can select for an element by its class using . followed by the class name (more from GeekstoGeeks here)
    • For example, this can be used in CSS to select for and style all elements with a specific class

HTML:

<div class="countries">
  <h3>United States</h3>
  <p class="place">Washington D.C.</p>
  <img src="https://cdn.aarp.net/content/dam/aarp/travel/destination-guides/2018/03/1140-trv-dst-dc-main.imgcache.revd66f01d4a19adcecdb09fdacd4471fa8.jpg">
</div>
    
<div class="countries">
  <h3>Mexico</h3>
  <p class="place">Guadalajara</p>
  <img src="https://cityofguadalajara.com/wp-content/uploads/2016/11/Centro-Historico-de-Guadalajara-800x288.jpg">
</div>

CSS:

<style>   
.countries {
  background-color: #e6e6e6;
  color: #336699;
  margin: 10px;
  padding: 15px;
}

.place {
  color: black;
}
</style>

Result:

United States

Washington D.C

Mexico

Guadalajara

Credit: HTML Classes from W3schools

6.2.3.2 id

  • The id attribute is used to specify one unique HTML element within the HTML document
  • An element can be identified by its id
  • You can select for an element by its id using # followed by the id name (more from GeekstoGeeks here)
    • For example, this can be used in CSS to select for and style a specific element with a certain id

HTML:

<div id="banner">My Banner</div>

CSS:

<style>
#banner {
  background-color: #e6e6e6;
  font-size: 40px;
  padding: 20px;
  text-align: center;
}
</style>

Result:

Credit: HTML Id from W3schools


6.2.4 Student Exercise

  • Spend 5-10 minutes playing with the simple HTML text below; experiment with whichever additional elements/tags/attributes/etc you want
  • Paste the below code into TryIt Editor and click Run
<!DOCTYPE html>
<html>
<head>
  <title>Page title (in head tag)</title>
</head>
<body>

  <h1>Title of level 1 heading</h1>
  
  <p>My first paragraph.</p>
  <p>My second paragraph.</p>
  <p>Add some bold text <strong>right here</strong></p>
  <p>Add some italics text <em>right here</em></p>
  

  <p>Include a hyperlink tag within a paragraph tag. this book looks interesting : <a href="https://bookdown.org/rdpeng/rprogdatascience/">R Programming for Data Science</a></p>  
  
  <p>Include another hyperlink tag within a paragraph tag. chapter on <a href="https://bookdown.org/rdpeng/rprogdatascience/regular-expressions.html">Regular Expressions</a></p>    
  <p> put a button inside this paragraph <button>I am a button!</button></p>
  
  <p>Here are some items in a list, but items not placed within an unordered list </p>
  
  <li> text you want in item</li>
  <li> text you want in another item</li>
  
  <p>Here are some items in an unordered list</p>
  
  <ul>
  <li> first item in unordered list </li>
  <li> second item in unordered list </li>
  </ul>

</body>
  
</html>


6.2.5 HTML Resources

Lots of wonderful resources on the web to learn HTML!

6.3 Webscraping using rvest

The rvest package

rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

Credit: rvest webpage

[rvest package contains] Wrappers around the xml2 and httr packages to make it easy to download, then manipulate, HTML and XML.

Credit: rvest package documentation


library(rvest)


Why use the rvest package?

  • rvest makes it easy to parse HTML
  • First, we use the read_html() function to read in the HTML and convert it to an xml_document/xml_node object
  • A node is just an HTML element
  • HTML is made up of nested elements, so once we’ve read in the HTML to a xml_node object, we can easily traverse the nested nodes (ie. children elements) and parse the HTML
  • rvest comes with many helpful functions to search and extract various parts of the HTML
    • html_node()/html_nodes(): Search and extract node(s) (ie. HTML elements)
    • html_text(): Extract the content between HTML tags
    • html_attr()/html_attrs(): Extract the attribute(s) of HTML tags

6.3.1 Reading HTML

The read_html() function:

?read_html

# SYNTAX AND DEFAULT VALUES
read_html(x, encoding = "", ..., options = c("RECOVER", "NOERROR", "NOBLANKS"))
  • Arguments:
    • x: The input can be a string containing HTML or url to the webpage you want to scrape
  • Output:
    • The HTML that is read in will be returned as an rvest xml_document/xml_node object and can be easily parsed
    • You can also view the raw HTML using as.character()

Scraping HTML from a webpage:

  • Navigate to the webpage (e.g., https://corona.help/) in your browser
    • If possible, use Google Chrome or Mozilla Firefox
  • View the HTML of the page by right clicking > View Page Source
    • This will be the raw HTML that is scraped when we use read_html()
  • When you right click, you may notice another option called Inspect (Chrome) or Inspect Element (Firefox) that will pop up a side panel
    • This can be helpful for visualizing the HTML elements on the page
    • You can also click on this side panel and hit ctrl + f (Windows) or cmd + f (Macs) to search for elements using a selector
    • But note that the HTML you see here might not be the same as what you see in View Page Source (i.e., what is scraped), since it also reflects changes made to the HTML after the page was loaded (e.g., by JavaScript)

Example: Using read_html() to read in HTML from string

html <- read_html("<h1>This is a heading.</h1><p>This is a paragraph.</p>")

# View object
html
## {html_document}
## <html>
## [1] <body>\n<h1>This is a heading.</h1>\n<p>This is a paragraph.</p>\n</ ...
# View class of object
class(html)
## [1] "xml_document" "xml_node"
# View raw HTML
as.character(html)
## [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<h1>This is a heading.</h1>\n<p>This is a paragraph.</p>\n</body></html>\n"

Example: Using read_html() to scrape the page https://corona.help/

corona <- read_html("https://corona.help/")

# View object
corona
## {html_document}
## <html class="loading" lang="en" data-textdirection="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="horizontal-layout horizontal-menu dark-layout 2-columns ...
# View class of object
class(corona)
## [1] "xml_document" "xml_node"
# View raw HTML [output omitted]
as.character(corona)
# Inspect raw HTML
str(as.character(corona))
##  chr "<!DOCTYPE html>\n<html class=\"loading\" lang=\"en\" data-textdirection=\"ltr\">\n<!-- BEGIN: Head--><head>\n<m"| __truncated__

6.3.2 Parsing HTML

The html_node() & html_nodes() functions:

?html_node
?html_nodes

# SYNTAX
html_node(x, css, xpath)
html_nodes(x, css, xpath)
  • Arguments:
    • x: An rvest xml_document/xml_node object (use read_html() to get this)
    • css: Selector (can select by HTML tag name, its attributes, etc.)
  • Output:
    • html_node() returns the first element that it finds as an rvest xml_node object
      • Recall that a node is just an HTML element
    • html_nodes() returns all elements that it finds as an rvest xml_nodeset object
      • All elements that are selected will be returned in a nodeset
    • Again, you can view the raw HTML using as.character()
      • Syntax: as.character(html_node(...))

Selecting for HTML elements:

  • HTML elements can be selected in many ways
    • Selecting by tagname: 'p', 'table', etc.
    • Selecting by class using .: '.my-class'
    • Selecting by id using #: '#my-id'
    • Selecting nested elements: 'table tr' (selects all rows within a table)
  • You can test your selector in your browser
    • Right click and select Inspect (Chrome) or Inspect Element (Firefox) to bring up a side panel
    • Hit ctrl + f (Windows) or cmd + f (Macs) and enter your selector to search for elements

Example: Using html_node() & html_nodes() to parse HTML string

Remember that the input to html_node()/html_nodes() should be an rvest xml_document/xml_node object, which we can obtain from read_html():

html <- read_html("<p>Paragraph #1</p><p>Paragraph #2</p><p>Paragraph #3</p>")

# View class of object
class(html)
## [1] "xml_document" "xml_node"
# View raw HTML to see what elements are there
as.character(html)
## [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>Paragraph #1</p>\n<p>Paragraph #2</p>\n<p>Paragraph #3</p>\n</body></html>\n"


If we search for the <p> element using html_node(), it will return the first result:

first_p <- html_node(html, 'p')

# View class of object
class(first_p)
## [1] "xml_node"
# View raw HTML
as.character(first_p)
## [1] "<p>Paragraph #1</p>\n"


If we search for the <p> element using html_nodes(), it will return all results:

all_p <- html_nodes(html, 'p')

# View class of object
class(all_p)
## [1] "xml_nodeset"
# View raw HTML
as.character(all_p)
## [1] "<p>Paragraph #1</p>\n" "<p>Paragraph #2</p>\n" "<p>Paragraph #3</p>"


Note that we could also use %>%:

# These are equivalent to the above
first_p <- html %>% html_node('p')
all_p <- html %>% html_nodes('p')

Example: Using html_node() & html_nodes() to parse https://corona.help/

Let’s revisit the HTML we scraped from https://corona.help/ in the previous example

  • We will try selecting for the “Total by country” table off of that page
  • In your browser, right click > View Page Source to check that the table element is indeed in the scraped HTML
  • Then, you can right click the table on the page and inspect it to better visualize the elements
# Scraped HTML is stored in this `xml_document`/`xml_node` object
class(corona)
## [1] "xml_document" "xml_node"


Select for the <table> element on that page using html_node():

# Since this table is the only table on the page, we can just use `html_node()`
corona_table <- corona %>% html_node('table')
corona_table
## {html_node}
## <table class="table table-striped table-hover-animation mb-0" id="table">
## [1] <thead id="thead"><tr>\n<th>COUNTRY</th>\n                           ...
## [2] <tbody>\n<tr>\n<td><a href="https://corona.help/country/united-state ...
# View class of object
class(corona_table)
## [1] "xml_node"
# View raw HTML of `corona_table` [output omitted]
as.character(corona_table)


Select all rows in the table (i.e., <tr> elements) using html_nodes()

  • It makes sense to select by row (rather than column) because each row usually represent an observation
  • The way HTML tables are structured also makes it easier to extract information by row because each <tr> element (i.e., row) has <th>/<td> elements (i.e., column cells) nested within it, and not the other way around
  • But if you wanted to select a certain column, there are ways to do that as well (e.g., table tr td:nth-child(1) selects the first cell in each row a.k.a. the first column in table)
# We can chain `html_node()`/`html_nodes()` functions
corona_rows <- corona %>% html_node('table') %>% html_nodes('tr')

# Alternatively, we can use `table tr` as the selector to select all `tr` elements within a `table`
corona_rows <- corona %>% html_nodes('table tr')

# Investigate object
head(corona_rows) # View first few rows
## {xml_nodeset (6)}
## [1] <tr>\n<th>COUNTRY</th>\n                          <th>INFECTED</th>\ ...
## [2] <tr>\n<td><a href="https://corona.help/country/united-states">\n     ...
## [3] <tr>\n<td><a href="https://corona.help/country/india">\n             ...
## [4] <tr>\n<td><a href="https://corona.help/country/brazil">\n            ...
## [5] <tr>\n<td><a href="https://corona.help/country/russia">\n            ...
## [6] <tr>\n<td><a href="https://corona.help/country/united-kingdom">\n    ...
typeof(corona_rows)
## [1] "list"
class(corona_rows)
## [1] "xml_nodeset"
length(corona_rows) # Number of elements
## [1] 227


6.3.3 Practicing regex

The following examples use the Coronavirus data from https://corona.help/

  • Recall that we have selected for all rows in the data table on that page in the previous example
  • If we wanted to try and create a dataframe out of this table, we could further select each cell in the table (i.e., <td> elements from each row)
  • For now, we will be practicing parsing data from each row using regex

View corona_rows we selected from previous example:

# View first few rows
head(corona_rows)
## {xml_nodeset (6)}
## [1] <tr>\n<th>COUNTRY</th>\n                          <th>INFECTED</th>\ ...
## [2] <tr>\n<td><a href="https://corona.help/country/united-states">\n     ...
## [3] <tr>\n<td><a href="https://corona.help/country/india">\n             ...
## [4] <tr>\n<td><a href="https://corona.help/country/brazil">\n            ...
## [5] <tr>\n<td><a href="https://corona.help/country/russia">\n            ...
## [6] <tr>\n<td><a href="https://corona.help/country/united-kingdom">\n    ...
corona_rows[1:5] # first five rows
## {xml_nodeset (5)}
## [1] <tr>\n<th>COUNTRY</th>\n                          <th>INFECTED</th>\ ...
## [2] <tr>\n<td><a href="https://corona.help/country/united-states">\n     ...
## [3] <tr>\n<td><a href="https://corona.help/country/india">\n             ...
## [4] <tr>\n<td><a href="https://corona.help/country/brazil">\n            ...
## [5] <tr>\n<td><a href="https://corona.help/country/russia">\n            ...
corona_rows[c(1)] # header row
## {xml_nodeset (1)}
## [1] <tr>\n<th>COUNTRY</th>\n                          <th>INFECTED</th>\ ...
corona_rows[1] # header row
## {xml_nodeset (1)}
## [1] <tr>\n<th>COUNTRY</th>\n                          <th>INFECTED</th>\ ...


Let’s convert this to raw HTML using as.character() to practice writing regular expressions. Refer back to this output to help you determine what pattern you want to match:

# Convert rows to raw HTML
rows <- as.character(corona_rows)[-c(1)] # [-c(1)] means skip header row

# View first few rows as raw HTML
writeLines(head(rows, 2))  # printing via writeLines() is much prettier than printing via print()
## <tr>
## <td><a href="https://corona.help/country/united-states">
##                               <div style="height:100%;width:100%">United States</div>
##                             </a></td>
##                           <td class="text-warning">29,381,220</td>
##                           <td class="text-warning text-bold-700">16,100</td>
##                           <td class="text-danger">529,465</td>
##                           <td class="text-danger text-bold-700">420</td>
##                           <td class="text-success">19,907,106</td>
##                           <td class="text-success text-bold-700">12,214</td>
##                           <td class="text-warning">8,944,649</td>
##                           <td class="text-danger">14,416</td>
##                           <td class="text-warning">362,072,955</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/india">
##                               <div style="height:100%;width:100%">India</div>
##                             </a></td>
##                           <td class="text-warning">11,156,250</td>
##                           <td class="text-warning text-bold-700">16,937</td>
##                           <td class="text-danger">157,471</td>
##                           <td class="text-danger text-bold-700">86</td>
##                           <td class="text-success">10,823,802</td>
##                           <td class="text-success text-bold-700">13,641</td>
##                           <td class="text-warning">174,977</td>
##                           <td class="text-danger">8,944</td>
##                           <td class="text-warning">217,618,057</td>
## 
##                           
##                         </tr>
# Investgate object named `rows`, which is a character vector
typeof(rows)
## [1] "character"
class(rows)
## [1] "character"
length(rows)
## [1] 226

Example: Using str_subset() to subset rows

Let’s filter for rows whose country name starts with 'United'. First, preview what our regular expression matches using str_view():

str_view_all(string = head(rows), pattern = 'United \\w+')

Inspect the output from str_detect(), which returns TRUE if there is a match and FALSE if not. For example, we see there is a TRUE for the first element (United States) and fifth element (United Kingdom):

str_detect(string = head(rows), pattern = 'United \\w+')
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE

Finally, subset rows by country name using str_subset(), which keeps elements of character vector for which str_detect() is TRUE (i.e., keeps elements where the pattern “matches”):

subset_by_country <- str_subset(string = rows, pattern = 'United \\w+')
writeLines(subset_by_country)
## <tr>
## <td><a href="https://corona.help/country/united-states">
##                               <div style="height:100%;width:100%">United States</div>
##                             </a></td>
##                           <td class="text-warning">29,381,220</td>
##                           <td class="text-warning text-bold-700">16,100</td>
##                           <td class="text-danger">529,465</td>
##                           <td class="text-danger text-bold-700">420</td>
##                           <td class="text-success">19,907,106</td>
##                           <td class="text-success text-bold-700">12,214</td>
##                           <td class="text-warning">8,944,649</td>
##                           <td class="text-danger">14,416</td>
##                           <td class="text-warning">362,072,955</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/united-kingdom">
##                               <div style="height:100%;width:100%">United Kingdom</div>
##                             </a></td>
##                           <td class="text-warning">4,200,734</td>
##                           <td class="text-warning text-bold-700">35</td>
##                           <td class="text-danger">123,530</td>
##                           <td class="text-danger text-bold-700">0</td>
##                           <td class="text-success">3,005,720</td>
##                           <td class="text-success text-bold-700">0</td>
##                           <td class="text-warning">1,071,484</td>
##                           <td class="text-danger">1,806</td>
##                           <td class="text-warning">91,520,691</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/united-arab-emirates">
##                               <div style="height:100%;width:100%">United Arab Emirates</div>
##                             </a></td>
##                           <td class="text-warning">399,463</td>
##                           <td class="text-warning text-bold-700">2,692</td>
##                           <td class="text-danger">1,269</td>
##                           <td class="text-danger text-bold-700">16</td>
##                           <td class="text-success">385,587</td>
##                           <td class="text-success text-bold-700">1,589</td>
##                           <td class="text-warning">12,607</td>
##                           <td class="text-danger">0</td>
##                           <td class="text-warning">31,284,394</td>
## 
##                           
##                         </tr>

Example: Using str_extract() to extract link for each row

Since all links follow the same pattern, we can use regex to extract this info:

links <- str_extract(string = rows, pattern = 'https://corona.help/country/[-a-z]+')

# View first few links
head(links)
## [1] "https://corona.help/country/united-states" 
## [2] "https://corona.help/country/india"         
## [3] "https://corona.help/country/brazil"        
## [4] "https://corona.help/country/russia"        
## [5] "https://corona.help/country/united-kingdom"
## [6] "https://corona.help/country/france"

Example: Using str_match() to extract country for each row

Since all countries are in a div element with the same attributes, we can use the following regex to extract the country name:

countries <- str_match(string = rows, pattern = '<div style="height:100%;width:100%">([\\w ]+)</div>')

# View first few countries
# We used a capturing group to extract the country name from between the tags
head(countries)
##      [,1]                                                        
## [1,] "<div style=\"height:100%;width:100%\">United States</div>" 
## [2,] "<div style=\"height:100%;width:100%\">India</div>"         
## [3,] "<div style=\"height:100%;width:100%\">Brazil</div>"        
## [4,] "<div style=\"height:100%;width:100%\">Russia</div>"        
## [5,] "<div style=\"height:100%;width:100%\">United Kingdom</div>"
## [6,] "<div style=\"height:100%;width:100%\">France</div>"        
##      [,2]            
## [1,] "United States" 
## [2,] "India"         
## [3,] "Brazil"        
## [4,] "Russia"        
## [5,] "United Kingdom"
## [6,] "France"

Example: Using str_match_all() to extract number deaths and critical for each row

Since both the number of deaths and critical are in a <td> element with the same class attribute, we can use the following regex to extract both numbers:

num_danger <- str_match_all(string = rows, pattern = '<td class="text-danger">([\\d,]+)</td>')

# View matches for first few rows
# We used a capturing group to extract the numbers from between the tags
head(num_danger)
## [[1]]
##      [,1]                                     [,2]     
## [1,] "<td class=\"text-danger\">529,465</td>" "529,465"
## [2,] "<td class=\"text-danger\">14,416</td>"  "14,416" 
## 
## [[2]]
##      [,1]                                     [,2]     
## [1,] "<td class=\"text-danger\">157,471</td>" "157,471"
## [2,] "<td class=\"text-danger\">8,944</td>"   "8,944"  
## 
## [[3]]
##      [,1]                                     [,2]     
## [1,] "<td class=\"text-danger\">257,562</td>" "257,562"
## [2,] "<td class=\"text-danger\">8,318</td>"   "8,318"  
## 
## [[4]]
##      [,1]                                    [,2]    
## [1,] "<td class=\"text-danger\">87,348</td>" "87,348"
## [2,] "<td class=\"text-danger\">2,300</td>"  "2,300" 
## 
## [[5]]
##      [,1]                                     [,2]     
## [1,] "<td class=\"text-danger\">123,530</td>" "123,530"
## [2,] "<td class=\"text-danger\">1,806</td>"   "1,806"  
## 
## [[6]]
##      [,1]                                    [,2]    
## [1,] "<td class=\"text-danger\">87,220</td>" "87,220"
## [2,] "<td class=\"text-danger\">3,586</td>"  "3,586"

Example: Using str_replace_all() to convert numeric values to thousands for each row

Rewrite all numeric values greater than one thousand in terms of k:

num_to_k <- str_replace_all(string = rows, pattern = '>([\\d,]+),\\d{3}<', replacement = '>\\1k<')

# View replacements for first few rows
writeLines(head(num_to_k))
## <tr>
## <td><a href="https://corona.help/country/united-states">
##                               <div style="height:100%;width:100%">United States</div>
##                             </a></td>
##                           <td class="text-warning">29,381k</td>
##                           <td class="text-warning text-bold-700">16k</td>
##                           <td class="text-danger">529k</td>
##                           <td class="text-danger text-bold-700">420</td>
##                           <td class="text-success">19,907k</td>
##                           <td class="text-success text-bold-700">12k</td>
##                           <td class="text-warning">8,944k</td>
##                           <td class="text-danger">14k</td>
##                           <td class="text-warning">362,072k</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/india">
##                               <div style="height:100%;width:100%">India</div>
##                             </a></td>
##                           <td class="text-warning">11,156k</td>
##                           <td class="text-warning text-bold-700">16k</td>
##                           <td class="text-danger">157k</td>
##                           <td class="text-danger text-bold-700">86</td>
##                           <td class="text-success">10,823k</td>
##                           <td class="text-success text-bold-700">13k</td>
##                           <td class="text-warning">174k</td>
##                           <td class="text-danger">8k</td>
##                           <td class="text-warning">217,618k</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/brazil">
##                               <div style="height:100%;width:100%">Brazil</div>
##                             </a></td>
##                           <td class="text-warning">10,647k</td>
##                           <td class="text-warning text-bold-700">0</td>
##                           <td class="text-danger">257k</td>
##                           <td class="text-danger text-bold-700">0</td>
##                           <td class="text-success">9,527k</td>
##                           <td class="text-success text-bold-700">0</td>
##                           <td class="text-warning">863k</td>
##                           <td class="text-danger">8k</td>
##                           <td class="text-warning">28,600k</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/russia">
##                               <div style="height:100%;width:100%">Russia</div>
##                             </a></td>
##                           <td class="text-warning">4,278k</td>
##                           <td class="text-warning text-bold-700">10k</td>
##                           <td class="text-danger">87k</td>
##                           <td class="text-danger text-bold-700">452</td>
##                           <td class="text-success">3,853k</td>
##                           <td class="text-success text-bold-700">15k</td>
##                           <td class="text-warning">337k</td>
##                           <td class="text-danger">2k</td>
##                           <td class="text-warning">111,800k</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/united-kingdom">
##                               <div style="height:100%;width:100%">United Kingdom</div>
##                             </a></td>
##                           <td class="text-warning">4,200k</td>
##                           <td class="text-warning text-bold-700">35</td>
##                           <td class="text-danger">123k</td>
##                           <td class="text-danger text-bold-700">0</td>
##                           <td class="text-success">3,005k</td>
##                           <td class="text-success text-bold-700">0</td>
##                           <td class="text-warning">1,071k</td>
##                           <td class="text-danger">1k</td>
##                           <td class="text-warning">91,520k</td>
## 
##                           
##                         </tr>
## 
## <tr>
## <td><a href="https://corona.help/country/france">
##                               <div style="height:100%;width:100%">France</div>
##                             </a></td>
##                           <td class="text-warning">3,783k</td>
##                           <td class="text-warning text-bold-700">0</td>
##                           <td class="text-danger">87k</td>
##                           <td class="text-danger text-bold-700">0</td>
##                           <td class="text-success">259k</td>
##                           <td class="text-success text-bold-700">0</td>
##                           <td class="text-warning">3,436k</td>
##                           <td class="text-danger">3k</td>
##                           <td class="text-warning">53,496k</td>
## 
##                           
##                         </tr>